A HPar: A Practical Parallel Parser for HTML –Taming HTML Complexities for Parallel Parsing

نویسندگان

  • Zhijia Zhao
  • Michael Bebenita
  • Dave Herman
  • Jianhua Sun
  • Xipeng Shen
چکیده

Parallelizing HTML parsing is challenging due to the complexities of HTML documents and the inherent dependences in its parsing algorithm. As a result, despite numerous studies in parallel parsing, HTML parsing remains sequential today. It forms one of the final barriers for fully parallelizing browser operations to minimize the browser’s response time—an important variable for user experiences, especially on portable devices. This paper provides a comprehensive analysis on the special complexities of parallel HTML parsing, and presents a systematic exploration in overcoming those difficulties through specially designed speculative parallelizations. This work develops, to the best of our knowledge, the first pipelining and data-level parallel HTML parsers. The data-level parallel parser, named HPar, achieves up to 2.4x speedup on quadcore devices. This work demonstrates the feasibility of efficient, parallel HTML parsing for the first time, and offers a set of novel insights for parallel HTML parsing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel Parsing: The Earley and Packrat Algorithms

Parsing plays a critical role in our modern computer infrastructure: scripting languages such as Python and JavaScript, layout languages such as HTML, CSS, and Postscript/PDF, and data exchange languages such as XML and JSON are all interpreted, and so require parsing. Moreover, by some estimates, the time spent parsing while producing a rendered page from HTML, CSS, and JavaScript is as much a...

متن کامل

Static Validation of Dynamically Generated HTML Documents Based on Abstract Parsing and Semantic Processing

Abstract parsing is a static-analysis technique for a program that, given a reference LR(k) context-free grammar, statically checks whether or not every dynamically generated string output by the program conforms to the grammar. The technique operates by applying an LR(k) parser for the reference language to data-flow equations extracted from the program, immediately parsing all the possible st...

متن کامل

PAPAGENO: A Parallel Parser Generator for Operator Precedence Grammars

In almost all language processing applications, languages are parsed employing classical algorithms (such as the LR(1) parsers generated by Bison), which are sequential due to their left-to-right state-dependent nature. Although early theoretical studies on parallel parsing algorithms delineated potential speedups on abstract parallel machines using a data-parallel approach, practical developme...

متن کامل

Language-Independent Text Parsing of Arbitrary HTML-Documents. Towards A Foundation For Web Genre Identification

This article describes an approach to parsing and processing arbitrary web pages in order to detect macrostructural objects such as headlines, explicitlyand implicitly-marked lists, and text blocks of different types. The text parser analyses a document by means of several processing stages and inserts the analysis results directly into the DOM tree in the form of XML elements and attributes, s...

متن کامل

XSS-FP: Browser Fingerprinting using HTML Parser Quirks

There are many scenarios in which inferring the type of a client browser is desirable, for instance to fight against session stealing. This is known as browser fingerprinting. This paper presents and evaluates a novel fingerprinting technique to determine the exact nature (browser type and version, eg Firefox 15) of a web-browser, exploiting HTML parser quirks exercised through XSS. Our experim...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013